﻿-------------------------------------------------------------------------------
Collection
-------------------------------------------------------------------------------

The corpus consists of three main packages, each corresponding to a specific collection effort:
names_orig, names, langs. The names_orig package was collected during the period 
November/December 2009 for the purpose of building an initial AutoSecretary system. 
The data contains 31 unique name-surname pairs read by at least 31 unique speakers.
To facilitate futher experimentation for call routing system optimization, the 
packages names and langs were subsequently collected during June - August 2012. The
prompts that were used during data collection were created by expanding a recongition grammar 
that contained nine (9) unique name-surname pairs and eleven (11) South African language names.
All recordings were collected from a combination of PSTN telephone and mobile communication networks.
Because it was not possible to obtain details from all the contributing speakers,  
no demographic information on the speaker population is released with the data.

-------------------------------------------------------------------------------
Directory structure
-------------------------------------------------------------------------------

The names_orig package is organised as follows:
 data
  |_ <package_id>
    |_ audio
      |_ test
        |_ *.wav
      |_ train
        |_ *.wav
    |_ transcriptions
      |_ test
        |_ *.txt
      |_ train
        |_ *.txt

The names and langs packages are organised as follows:
 data
  |_ <package_id>
    |_ audio
      |_ *.wav
    |_ transcriptions
      |_ *.txt

Each file ID consists of:
  pcrt_<package_id>_<speaker_id>_<utterance_id>.<ext>
where <package_id> can be names_orig, names or langs and <ext> can be txt or wav.

Each orthographic transcription (*.txt) consists of white-space delimited word tokens.
Each audio file (*.wav) consists of the acoustic representation of the word tokens.

-------------------------------------------------------------------------------

The main directory consists of four folders:
  data
    |_ info
    |_ langs
    |_ names
    |_ names_orig

The info folder includes 4 additional resources:
  info
    |_concept_maps
      |_langs.map
      |_names.map
    |_dictionary
      |_pcrt.dict
    |_grammars
      |_langs.grammar
      |_names.grammar
    |_metadata
      |_meta_langs.txt
      |_meta_names.txt
      |_meta_names_orig_test.txt
      |_meta_names_orig_train.txt

"concept_maps" contains the specific word level recognition alternatives 
to the name and language concepts of the names and langs packages.
"pcrt.dict" contains pronunciations for all word tokens in the corpus.
The grammar directory provides the word recognition grammars for the 
names and langs packages and finally the metadata directory contains 
all additional metadata for the three packages included in the corpus.

-------------------------------------------------------------------------------

